Cardiovascular Disease Prediction

Problem Statement

Readings of patients are given and the objective is to build an application to classify the patients to be healthy or suffering from cardiovascular disease based on the given attributes. People with high readings of CVD factors owith high risk of Cardiovascular disease need early detection and management where machine learning models can be of great help.

Data Definition

The dataset represents features features of patients. The data definition is as follow: ----Age: Age of the patient (integer)

----Height: Height of a person in cm (interger)

----Weigh:Weight of a patient in Kg (float)

----Gender:gender of the patient (categorical code) 1-women,2-men

----Systolic blood pressure: ap_hi (int)

----Diastolic blood pressure: ap_lo (int)

----Cholesterol: cholesterol 1: normal, 2: above normal, 3: well above normal

----Glucose:glucose levels 1: normal, 2: above normal, 3: well above normal

----Smoking: whether a patient smokes or not (binary) 1-smokes,2-doesn't smoke

----Alcohol intake: whether a patient consumes alochol or not (binary)1-alcoholic,2-non-alcoholic

----Physical activity: whether the patient actively participates in physical activity or not (binary) 1-does physical activity,2-doesnt do physical activity

----Presence or absence of cardiovascular disease: Target variable(binary) 1-Presence of the disease,2-Absence of the disease

Table of Content

  1. Import Libraries
  2. Read Data
  3. Exploratory Data Analysis
  4. Classification Models
  5. Hierarchical Clustering
  6. DBSCAN
  7. Visualize the Clusters

1. Import Libraries

Import the required libraries and functions.

2. Read Data

3. Exploratory Data Analysis

3.1 Understand the Dataset

Dimensions of the data

There are 70000 observations and 13 columns in the dataset.

3.2 Data Type

1. Check for the data type

Explicit type casting has been done to the variables which are in the form of int instead of object,but the target variable is not coverted

3.3 Distribution of Variables

Check the distribution of all the variables.

There are variables of types continous and categorical, we plot the boxplot for each continous variable to check the distribution. Also, we can use these boxplots to identify the outliers in the variables.We plot countplot for the categorical data to get the count of the categories

The above boxplots show that the variables 'ap_hi' and 'ap_lo' are not normally distributed, and the other variables are near normally distributed.

Also, it can be easily seen that all the variables have outliers.
The above barplots/countplots give the count of the labels in the data.The target variable cardio has almost equal count

3.4 Feature Engineering

New column "BMI"-Body Mass Index is created with height and weight variables and gender columns values are turned to 0 and 1

3.5 Univariate Analysis

The skewness is almost 0 then we can say that age variable is normally distributed
The outliers have been removed from the age variable
The height variable is slightly negatively skewed and has alot of outliers. After removing the outliers the variable seems normal and 523 values have been removed
The weight variable is positively skewed and has alot of outliers. The upper limit of weight is important for the analysis so only the lower limit values are droppped.45 outlier values have been removed
The systolic blood pressure variable is highly positively skewed and has alot of outliers. After the outlier treatment the skewness has drasticly reduced and became near normal.
The diastolic blood pressure variable is highly positively skewed and has alot of outliers. After the outlier treatment the skewness has drasticly reduced and became near normal.
The BMI variable is positively skewed and has outliers. After the outlier treatment the skewness has drasticly reduced and became near normal.

Insights

  1. Gender column has 65% of male and 35% of female patient.

  2. 74% of patients have normal cholesterol levels and only 11% have high cholesterol levels.

  3. 84% of patients have normal glucose levels and only 7% have high glucose levels

  4. 91% of patients dont smoke and only 8% persons smoke.

  5. 94% of patients are not alcoholic and only 5% are alcoholic.

  6. 80% patients are doing physical activity and 19% are not.

3.6 Multivariate Analysis

The boxplot of the continous variables explains their relation with the target variable 'Cardio'

1.The patients with cardio disease are of average age of 57.

2.The height doesn't show any relation ship with cardio as both have the same height.

3.The weight of patients with the disease have higher weight.

4. patients with the disease have higher systollic blood pressure and Diastollic blood pressure
The barplot of the categorical variables explains their relation with the target variable 'Cardio'

1.There is complete balance in gender of the people with and without the disease.

2.The patients with different levels of cholesterol are prone to the disease.

3.The count of the people with the disease and also who smoke are very less.

4. Very less people who are alcoholic are prone to the disease.

5.Patients with cardio are less active comparatively.

3.7 Summary Statistics

The above output illustrates the summary statistics of the numerical variables.
The id column has all unique values so we can ignore it for analysis.The average age of the patients is 53 and the average BMI is 14kg/m2

The final shape of the data is 63104 rows and 12 columns after we remove id column

3.8 Missing Values

Let us plot the heatmap to visualize the missing values in the data.

The above plot shows that there are no missing values in the data.

3.9 Prepare the Data

Thus, we have scaled all the numeric features in the data and dummy encoded categorical and stored it in a dataframe 'features_scaled'.

4. Classification Models

4.1 Build the Base Model-Logistic regression

* constant: the odds of patient having CVD is 1/e**0.1300, considering all the variables.
* age : 0.3453 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**0.3453 due to unit increase in the age
* height : -0.0406 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**-0.0416 due to unit increase in the height and is insignificant
* weight: 0.1302 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**0.1302 due to unit increase in the weight and is insignificant
* ap_hi : 0.9053 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**0.9053 due to unit increase in the ap_hi
* ap_lo : 0.1250 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**0.1250 due to unit increase in the ap_lo
* BMI : 0.0219 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**0.0219 due to unit increase in the BMI and is insignificant
* gender_1 : 0.0099, it implies that the odds of a individual having cardiovascular disease increases by a factor of e**0.0099 due to unit increase in the gender_1 and is insignificant
* cholesterol_2:0.3458 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**0.3458 due to unit increase in the cholesterol_2
* cholesterol_3:1.0827 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**1.0827 due to unit increase in the cholesterol_3
* gluc_2 : 0.0079 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**0.0079 due to unit increase in the gluc_2
* gluc_3 : -0.3636 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**-0.3636 due to unit increase in the gluc_3
* smoke_1 : -0.1166, it implies that the odds of a individual having cardiovascular disease increases by a factor of e**-0.1166 due to unit increase in the smoke_1.
* alco_1 : -0.2110 , it implies that the odds of a individual having cardiovascular disease increases by a factor of e**-0.2110 due to unit increase in the alco_1
* active_1 : -0.2364, it implies that the odds of a individual having cardiovascular disease increases by a factor of e**-0.2364 due to unit increase in the active_1.
* The recall is 0.67 and roc_auc_score is 0.79 that means logistic regression is not a bad model for this data but we can try to find a model that improves the model performance

4.2 Decision Tree

* We can observe that the accuracy score of the full model is 0.63 and recall is 0.62 which is not good model for this data.

tuned_paramaters = [{'criterion': ['entropy', 'gini'], 'max_depth': range(2, 10), 'max_features': ["sqrt", "log2"], 'min_samples_split': range(2,10), 'min_samples_leaf': range(1,10), 'max_leaf_nodes': range(1, 10)}] decision_tree_classification = DecisionTreeClassifier(random_state = 10) DT_grid = GridSearchCV(estimator = decision_tree_classification, param_grid = tuned_paramaters, cv = 5) DT_grid_model = DT_grid.fit(xtrain, ytrain) print('Best parameters for decision tree classifier: ', DT_grid_model.bestparams, '\n')

We have formed three clusters, where there are 9041 observartions in one cluster, and 5052 and 2631 observations in the other two clusters.

4.3 Random Forest

4.4 K-Nearest Neighbours

4.5 Naive Bayes

4.6 Adaptive Boosting